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1. TITLE OF THE INVNETION 

Pohneme Dividing Method Using Multilevel Neural Network 

2. BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a system to which the present 
invention is applied. 

FIG. 2 shows a configuration of a multilevel neural network 
used for the present invention. 

FIG. 3 is a flickered of one embodiment of the present 
invention. 

* description of the principle reference numerals 
1: voice input portion 
2: preprocessor 

3: MLP phoneme dividing portion 
4: phoneme border output portion 

3. DETAILED DESCRIPTION OF THE INVENTION 

The present invention relates to a phoneme dividing method 
using a multilevel neural network. 
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Conventional phoneme dividing technologies complicate their 
systems by finding the border of phonemes through an analysis to 
which prefixed various phonetic knowledge and rules after 
extracting the frequency component, that is, the spectrogram, from 
5 an acoustic signal. 

Without an effective and optimal method for combining various 
knowledge and rules used in phoneme division, the performance of 
system is not reliable and drastically deteriorated depending upon 
the change of situation. 
10 There is a method for finding the border of phoneme by 

comparing characteristic patterns with an incoming signal in 
phoneme division after previously extracting the characteristics of 
all phonemes and storing them in patterns. This method requires 
information on the characteristic patterns for all phonemes to 
15 undesirable increase the volume of memory of the system and also 
the amount of calculation in performance. 

Therefore, it is an object of the present invention to provide 
a phoneme dividing method using a multilevel neural network for 
precisely and efficiently capturing the point of phoneme border, 
20 using only the variation of vocal signal appearing at the border of 
phonemes, without additional knowledge for phoneme itself, to be 
thereby utilized in application fields requiring automatic phoneme 
division or phoneme labeling. 

To accomplish the object of the present invention, there is 
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provided a phoneme dividing method using a multilevel neural 
network applied to a phoneme dividing apparatus having a voice 
input portion for outputting a vocal sample digitally converted 
from voice made, a preprocessor for extracting a characteristic 
5 vector suitable for phoneme division, from the vocal sample input 
from the voice input portion, a multi-layer perception (MAP) phoneme 
dividing portion for finding and outputting the border of phoneme, 
using the characteristic vector of the preprocessor, and a phoneme 
border outputting portion for outputting position information on 

10 the border of phoneme of the MAP phoneme dividing portion in the 
form of frame position, the method comprising the steps of: (a) 
sequentially segmenting and framing voice with digitalized voice 
samples, extracting characteristic vectors by vocal frames, and 
extracting an inter-frame characteristic vector of the difference 

15 between nearby frames of the characteristic vectors by frames, to 
thereby normalize the maximum and minimum of the characteristics; 
(b) initializing weights present between an input layer and hidden 
layer and between the hidden layer and output layer of the MAP, 
designating an output target data of the MAP, unpitying the 

20 characteristic vectors to the MAP for learning, and storing and 
finishing information on the weight obtained through learning and 
the standard of the MAP if the reduction rate of mean squared error 
converges within a permissible limit; and (c) reading the weight 
obtained in the step (b) , receiving the characteristic vectors, 
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performing an operation of phoneme border discrimination to 
generate an output value, discriminating the phoneme border 
according to the output value, and if the current analyzed frame 
arrives two frames preceding the final frame of incoming voice, 
5 outputting a frame number indicative of the border of phoneme as a 
final result. 

Hereinafter, a preferred embodiment of the present invention 
will be described below. 

In FIG. 1, reference numeral 1 represents a voice input 
10 portion. Reference numeral 2 is a preprocessor, 3 being a multi- 
layer perception (MAP) phoneme dividing portion, and 4 being a 
voice border output portion. 

Voice input portion 1 comprises a microphone for converting an 
aerial vocal waveform into an electric vocal signal, a band-pass 
15 filter for eliminating low-frequency noise and high-frequency 
aliasing from the vocal signal input as an electric analog signal, 
and an analog-to-digital converter (ADC) for converting the analog 
vocal signal into a digital vocal signal. The voice input portion 
output a vocal sample converted into digital from the voice, to 
20 preprocessor 2. 

Preprocessor 2 extracts characteristic vectors suitable for 
phoneme division from the vocal samples input from voice input 
portion 1, and outputs them to MLP phoneme dividing portion 3. MLP 
phoneme dividing portion 3 finds the border of phoneme, using 
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characteristic vectors input from preprocessor 2, and outputs the 
result to phoneme border output portion 4. Phoneme border output 
portion 4 outputs position information on phoneme border 
automatically divided in MLP phoneme dividing portion 3 in the form 
5 of frame position. 

Referring to FIG. 2, one embodiment of the present invention 
implements an effective and reliable automatic phoneme segmenter by 
using a multi-layer perceptron (MLP) , one kind of neural network, 
in order to complement the drawbacks of the conventional phoneme 

10 dividing method based upon knowledge or rules. 

A phoneme dividing method using MLP is very favorable to 
solving decrease of performance caused due to imperfect modeling of 
knowledge or rules on the border of phoneme contained in a vocal 
signal. In this method, functions required in phoneme division are 

15 learned voluntarily from the characteristic vectors extracted from 
a large amount of vocal data so that the MLP itself finds the 
knowledge or rules contained in the vocal signal, without 
previously introducing specific suppositions, rules or knowledge on 
the border of phoneme. Accordingly, the method of the present 

20 invention eliminates the introduction of unsure supposition or 
additional processing of distribution or modeling of the vocal 
signal in order to facilitate its modeling. 

MLP used in the present invention is made in a multiple 
structure of three layers of input, hidden and output layers. As 
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shown in the drawing, the input layer placed on the bottom is made 
with 73 input nodes of 72 input nodes for inter-frame 
characteristic vectors extracted from four inter-frame differences 
generated among five sequential frames, and one input node for an 
5 input value 1 to be used instead of the threshold value comparison 
process in the hidden layer of MLP. 

The output node of the output layer is made with two nodes of 
the first node indicative of the border of phoneme, and the second 
node not indicative of the border of phoneme. The hidden layer 
10 placed between the input and output layers is to perform nonlinear 
discrimination that the MLP must implement actually. 

The following nonlinear sigmoid function is used for the 
activation function of the hidden layer, 
y = (exp(x) - 1) - (exp(x) + 1) 
15 where x and y represent the input and output of the activation 
function, respectively. 

The number N of nodes of the hidden layer is known to be 
closely relevant to the final function of MLP. It is noted through 
an experiment using various kinds of data that it is appropriate 
20 that the number of nodes be between 10 and 30. Between the input 
layer and hidden layer and between the hidden layer and output 
layer, there are weights which connect all the nodes of the 
respective layers. Because the weights connect all the nodes 
between the layers, its number is 73 x N (the number of input nodes 
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x the number of hidden nodes) in case of the input layer and hidden 
layer. The number of weights is N x 2 (the number of hidden node x 
the number of output node). These functions are previously obtained 
through learning using an error back propagation algorithm, stored 
5 in a memory, and then read out in phoneme division. 

FIG. 3 shows a procedure of the phoneme division algorithm in 
preprocessor 2 and MLP phoneme dividing portion 3, having two parts 
of learning process and dividing process of the MLP phoneme 
dividing algorithm. 

10 Above all, the process of voice framing and characteristic 

vector extraction is performed in preprocessor 2 and used commonly 
in the learning and dividing processes. In selection the 
characteristic vectors in the present invention, factors explicitly 
indicative of the difference of spectrum between frames are induced 

15 in order to use the fact that the variation of vocal spectrum is 
severe at the border phonemes. 

Voice is sequentially segmented in a length so long as to 
extract the characteristics of voice from digitalized voice 
samples, for the purpose of voice framing in step 10. Voice framing 

20 is performed by taking Hamming windows in the length of 16 msec 
every 10 msec with respect to the overall incoming vocal samples. 

Then, the characteristic vectors are extracted from the voice 
frames in step 11 containing two substeps . In the first step, 
characteristic vectors by frames effectively indicative of the 
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characteristics of voice are extracted on basis of phonetic 
knowledge, with respect to the respective voice frames obtained 
before* In the second step, inter-frame characteristic vectors of 
the difference between nearby frames with respect to the 
characteristic vectors by frames obtained in the first step are 
extracted to be used as the final characteristic vectors input to 
MLP phoneme dividing portion 3 . 

For more detailed description of the above procedure, the 
characteristic vectors first obtained with respect to the 
respective frames are as follows. 

(1) frame energy: indicates the intensity of phonation by 
frames and is found according to the following equation. 

ENG_FRW*>log i0 (J2 *(»)**0i))^-O f l^./r 

N 

where s (n) represents a vocal sample belonging to the t th frame, and 
N represents the length of vocal frame. 

(2) 16th degree Mel-scaled fast Fourier transform (FFT) : 
First, FFT is performed in order to obtain the spectrum, the 
frequency characteristic of voice by frames, and the frequency 
component of voice is classified into predetermined 16 frequency 
bands similar to the human hearing characteristics, to obtain the 
degree energy by bands which is used as the coefficient of the Mel- 
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scaled FFT . The j th degree Mel-scaled FFT coefficient for frame 
index t is obtained as follows* 

/-i 

5 

where f represents the frequency belonging to the respective 
frequency bands; 

j is the index of the respective frequency bands; and 
s(j,t,f) is j th degree frequency band amplitude spectrum of t th frame 

10 obtained from FFT by frequencies. 

(3) energy ratio by bands: It is very important to precisely 
discriminate phoneme into voiced sound and voiceless sound in 
phoneme division. The difference between voiced and voiceless 
sounds is the distribution of energy by frequency bands. In order 

15 to discriminate voiceless and voiced sounds in the present 
invention, the low- frequency energy between 0 and 3 kHz and the 
high-frequency energy between 3 and 8 kHz are obtained 
respectively, and their ratio is selected as one of the 
characteristic vectors. 

20 
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ENQJtTQf) = lo^ENQJsO ^/)) - \o^{ENQJIIGHt)) 
ENGJLOW) = J2 s <f> '),/=<>, ... , 3kHz 
ENGJHGHf) = ^s(f, r),/=3, ... , %kHz 



Where ENG_LOW ( t ) , and ENG_HIGH(t) are energies of the low and high 
frequency bands of the t th voice frame, respectively, which are 
obtained by the sum of components contained in the respective bands 
at the amplitude spectrum obtained in the FFT\ 

The inter-frame characteristic vectors used as the final input 
of MLP phoneme dividing portion 3 can be obtained by finding the 
difference between nearby frames with respect to the first 
characteristic vectors by frames on basis of the fact that the 
variation of phoneme division occurs at the border of phonemes. 

(1) difference of frame energy between nearby frames 
dENG_FRM ( t ) = ENG_FRM ( t ) - ENG_FRM(t-l) 

(2) inter-frame difference of 16 th degree Mel-scaled FFT 
dMSFC(j,t) = MSFC(j,t) - MSFC(j,t-l) , J=0,1, 15 

Here, j represents the respective degrees of the coefficients. 

(3) inter-frame difference of energy ration by frames 
dENG_RTO(t) = ENG_RTO(T) - ENG_RTO(t-l) 

After the characteristic vectors are extracted as above, they 
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are normalized in step 12 whose maximum and minimum become 1 and - 
1, respectively/ in order to be used as the input of MLP phoneme 
dividing portion 3. 

In the learning process of MLP phoneme dividing portion 3 
using the normalized characteristic vectors, weights present 
between the input and hidden layers and the hidden and output 
layers are initialized in step 13 as the initial learning step of 
MLP phoneme dividing portion 3. The initial value is established as 
an arbitrary value distributed between 1 and -1. 

After this step, output target data of the output layer, which 
teaches finding the border of phonemes, is designated in step 14. 
The output target data by frames is equal to the number of the MLP 
output nodes, having values of (1,-1) in case of the border of 
phoneme and (-1,1) in other cases. This output target data is made 
to coincide with the frame position of corresponding characteristic 
vectors using information on the border of phoneme obtained from 
previously phoneme-divided voice database. 

After the designation of output target data, the 
characteristic vectors, learning data, are input to the input layer 
of the MLP in step 15 so as to teach the MLP in step 16. The input 
layer has 73 nodes of 72 input nodes for the input of the four, 
sequential inter-frame characteristic vectors and one input node 
for 1 to be input instead of the threshold value comparison 
procedure of the hidden layer. 
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The four inter-frame characteristic vectors are extracted 
among four intervals generated from five frames including preceding 
and succeeding two frames t-2, t-1, t+1, t+2, centering on the 
currently analyzed frame t, as shown in the lower portion of FIG. 
2. The learning algorithm of the phoneme dividing MLP uses the 
generally used error back propagation algorithm. 

After this learning process of MLP, if the reduction rate of 
mean squared error converges within a permissible limit in step 17, 
the weights obtained through learning and information on the 
standard of the MLP are stored in step 18 to finish the learning 
process. After the learning process, the voice is sequentially 
segmented in a length so long as to extract the characteristics of 
voice from the digitalized vocal samples for voice framing in step 
10, and the characteristic vectors are extracted in step 11 and 
normalized in step 12. 

The weights obtained in the learning process are read into the 
hidden layer of the MLP in step 19. Then, the 72 characteristic 
vectors obtained in the above process are input in the sequence of 
the input nodes of the MLP, and 1 is input to the final 73 th input 
node in step 20. 

In MLP phoneme dividing portion 3, the output value for 
phoneme border discrimination is produced through the following MLP 
operation with respect to incoming characteristic vectors in step 
21. 
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IDQ)=SGMOI*£,IN(i) x WGTJH(Uf)\ /=0, 1,...72 J=OX.„/t-ID(N-l)=l t 

i 

U7{k)=SGMOlg2HID(f) x WGTHO(f t k\ /=0, *=<U 
j 

Where IN(j) represents the input of the i th input node; 
5 OUT(k) is the output of the k th output node; 

WGT_IH(i, j) is the weight connecting the i th input node and j th 
hidden node; 

WGT_HO(j,k) is the weight connecting the j th hidden node and 
the k th output node; and 
10 SGMOD represents the aforementioned sigmoid function* 

Value 1 is designated to the final hidden node instead of the 
threshold comparison procedure in the final output node. 

When the output values operated in MLP phoneme dividing 
portion 3 are compared in discriminating the border of phoneme, if 
15 the first output value OUT(0) is positive, the analyzed frame is 
the border of phoneme. In contrast, if OUT(l) is positive, it is 
determined in step 22 that the frame is not the border of phoneme. 

In step 23, t is checked whether the currently analyzed frame 
arrives two frames preceding the final frame of the incoming voice. 
20 If not, the procedure of inputting the characteristic vectors to 
the MLP input layer is iterated. If the currently analyzed frame 
arrives two frames preceding the final frame, the value expressed 
as a frame number indicative of the border of phoneme is output as 
the final result in step 24, and the whole procedure ends. 
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In implementing a voice recognition system which makes it 
possible the conversation between human beings and machines, the 
present invention operating as above divides voice in units of 
phoneme and enables precise and effective phoneme division 
preprocessing essentially required phoneme recognition based upon 
phoneme division with respect to the divided phoneme segments. In 
addition, the present invention enables automatic voice experts in 
constructing a large volume of phoneme-divided voice database 
required in implementing a phoneme-unit voice recognition and voice 
mixing system. This reduces time and cost. 

Although the present invention has been described above with 
reference to the preferred embodiments thereof, those skilled in 
the art will readily appreciate that various modifications and 
substitutions can be made thereto without departing from the spirit 
and scope of the invention as set forth in the appended claims. 
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4. CLAIMS 

1. A phoneme dividing method using a multilevel neural network 
applied to a phoneme dividing apparatus having a voice input 
5 portion for outputting a vocal sample digitally converted from 
voice made, a preprocessor for extracting a characteristic vector 
suitable for phoneme division, from the vocal sample input from the 
voice input portion, a multi-layer perceptron (MLP) phoneme 
dividing portion for finding and outputting the border of phoneme, 

10 using the characteristic vector of the preprocessor, and phoneme 
border outputting portion for outputting position information on 
the border of phoneme of the MLP phoneme dividing portion in the 
form of frame position, said method comprising the steps of: 

(a) sequentially segmenting and framing voice with digitalized 

15 voice samples, extracting characteristic vectors by vocal frames, 
and extracting an inter-frame characteristic vector of the 
difference between nearby frames of the characteristic vectors by 
frames, to thereby normalize the maximum and minimum of said 
characteristics; 

20 (b) initializing weights present between an input layer and 

hidden layer and between the hidden layer and output layer of said 
MLP, designating an output target data of said MLP, inputting said 
characteristic vectors to said MLP for learning, and storing and 
finishing information of the weight obtained through learning and 

15 



the standard of said MLP if the reduction rate of mean squared 
error converges within a permissible limit; and 

(c) reading the weight obtained in said step (b) , receiving 
said characteristic vectors, performing an operation of phoneme 
border discrimination to generate an output value, discriminating 
the phoneme border according to the output value, and if the 
current analyzed frame arrives two frames preceding the final frame 
of incoming voice, outputting a frame number indicative of the 
border of phoneme as a final result. 

2. The method as claimed in claim 1, wherein the voice framing 
of said step (a) is performed by taking a Hamming Window in a 
length of 16 msec every 10 msec with respect to the overall 
incoming vocal samples. 

3. The method as claimed in claim 1, wherein the phoneme 
border discrimination of said step (c) is performed in such a 
manner that output values generated through operation are compared, 
and then it is determined that if output value OUT(0) is positive, 
an analyzed frame is the border of phonemes, and if output value 
OUT(l) is positive, the frame is not the border of phonemes. 
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ABSTRACT 



A phoneme dividing method using a multilevel neural network 
applied to a phoneme dividing apparatus having a voice input 
portion, a preprocessor, a multi-layer perception (MLP) phoneme 
dividing portion, and a phoneme border outputting portion includes 
the steps of: (a) sequentially segmenting and framing voice with 
digitalized voice samples, extracting characteristic vectors by 
vocal frames, and extracting an inter-frame characteristic vector 
of the difference between nearby frames of the characteristic 
vectors by frames, to thereby normalize the maximum and minimum of 
the characteristics; (b) storing information on the weight obtained 
through learning and the standard of the MLP; and (c) reading the 
weight obtained in the step (b) , receiving the characteristic 
vectors, performing an operation of phoneme border discrimination 
to generate an output value, discriminating the phoneme border 
according to the output value, and if the current analyzed frame 
arrives two frames preceding the final frame of incoming voice, 
outputting a frame number indicative of the border of phoneme as a 
final result. 
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